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Abstract 

We present a new lossy compressor for discrete sources. For coding a source sequence x", the encoder starts 
by assigning a certain cost to each reconstruction sequence. It then finds the reconstruction that minimizes this 
cost and describes it losslessly to the decoder via a universal lossless compressor. The cost of a sequence is given 
by a linear combination of its empirical probabilities of some order k + 1 and its distortion relative to the source 
sequence. The linear structure of the cost in the empirical count matrix allows the encoder to employ a Viterbi-like 
algorithm for obtaining the minimizing reconstruction sequence simply. We identify a choice of coefficients for 
the linear combination in the cost function which ensures that the algorithm universally achieves the optimum 
rate-distortion performance of any Markov source in the limit of large n, provided k is increased as o(logn). 

I. Introduction 

Let X = {Xi : i > 1} represent a discrete- valued stationary ergodic process with unknown statistics, 
and consider the problem of compressing X at rate R such that the incurred distortion is minimized. Let 
X and X denote finite source and reconstruction alphabets respectively. The performance of the described 
coding scheme is measured by its average expected distortion between source and reconstruction blocks, 
i.e. 

1 " 

D = Eci„(X",X") ^-J2^d{Xi,X,), (1) 

1=1 

where d : X x X ^ R+ is a single-letter distortion measure. For any R > 0, the minimum achievable 
distortion (cf. [4] for exact definition of achievability) is characterized as [1], [2], [3] 

L'(X,i?)=lim min Ec/„(X",X"). (2) 

A sequence of codes at rate R is called universal if for every stationary ergodic source X its asymptotic 
performance converges to -D(X, i?), i.e., 

limsupErf„(X",X") < D{X,R). (3) 

n— >oo 

For lossless compression where the source is to be recovered without any errors, there already exist 
well-known implementable universal schemes such as Lempel-Ziv coding [5] or arithmetic coding [6]. In 
contrast to the situation of lossless compression, for D > 0, there are no well-known practical schemes that 
universally achieve the rate-distortion curve. In recent years, there has been progress towards designing 
universal lossy compressor especially in trying to tune some of the existing universal lossless coders to 
work in the lossy case as well [7], [8], [9]. All of these algorithms are either provably suboptimal, or 
optimal but with exponential complexity. 

Another approach for lossy compression, which is very well-studied in the literature and even imple- 
mented in JPEG 2000 image compression standard, is Trellis coded quantization, i.e. Trellis structured 
code plus Viterbi encoding (cf. [10], [11] and references therein). This method is in general suboptimal 
for coding sources that have memory [11]. In [12], an algorithm for fixed-slope Trellis source coding is 



proposed, and is shown to be able to get arbitrary close to the rate-distortion curve for continuous-valued 
stationary ergodic sources. The proposed method is efficient in low rate region. 

In a recent work [13], a new implementable algorithm for lossy compression of discrete- valued stationary 
ergodic sources was proposed. Instead of fixing rate (or distortion) and minimizing distortion (or rate), 
the new algorithm fixes Lagrangian coefficient a, and minimizes R + aD. This is done by assigning 
energy £^(y") representing R + aD to each possible reconstruction sequence and finding the sequence that 
minimizes the cost by simulated annealing. The algorithm starts by letting = x", and at each iteration 
chooses an index i E {1, . . . ,n} uniformly at random, and probabilistically changes yi to some y E X 
such that there is a positive probability (which goes to zero as the number of iterations increases) that the 
resulting sequence has higher energy than the original sequence. Allowing the energy to increase especially 
at initial steps prevents the algorithm from being entrapped in a local minimum. It was shown that using 
a universal lossless compressor to describe the reconstruction sequence resulting from this process to the 
decoder results in a scheme which is universal in the limit of many iterations and large block length. The 
drawback of the proposed scheme is that although its computational complexity per iteration is independent 
of the block length n and linear in a parameter kn = o(log?7,), there is no useful bound on the number 
of iterations required for convergence. In this paper, inspired by the previous method, we propose yet 
another approach for lossy compression of discrete Markov sources which universally achieves optimum 
rate-distortion performance for any discrete Markov source. We start by assigning the same cost that 
was defined for each possible reconstruction sequence in [13]. The cost of each sequence is a linear 
combination of two terms: its empirical conditional entropy and its distance to the source sequence to be 
coded. We show that there exists proper linear approximation of the first term such that minimizing the 
linearized cost results in the same performance as minimizing the original cost. But the advantage is that 
minimizing the modified cost can be done via Viterbi algorithm in lieu of simulated annealing which was 
used for minimizing the original cost. 

The organization of the paper is as follows. In Section |IIl we set up the notation, and define the 
count matrix and empirical conditional entropy of a sequence. Section UlI] describes a new coding scheme 
for fixed-slope lossy compression which universally achieves the rate-distortion curve for any discrete 
Markov source and |IV] describes how to compute the coefficients required by the algorithm outlined in 
the previous section. Section |V] explains how Viterbi algorithm can be used for implementing the coding 
scheme described in Section Hill Section |VI] presents some simulations results, and finally. Section IVIII 
concludes the paper with a discussion of some future directions. 

Proofs that are not presented in the paper will appear in the full version. 

II. Notations and required definitions 
Let X and X denote the source and reconstruction alphabets respectively. Let matrix m{y^) E R''^' x 
RI'^I'^ represent [k + ly^ order empirical count of defined as 

m^Avn = ^\{l<t<n: y^l = h,y, = P]}\. (4) 

In (HI), and throughout we assume a cyclic convention whereby yi = yn+i for i < 0. Let Hk{y^) denote 
the conditional empirical entropy of order k induced by y", i.e. 

Hk{yn=HiYk+i\Y''), (5) 

where on the right hand side of ^ is distributed according to 

P(F'=+i = [b,/3])=m^,b(y"), (6) 



where P E X, and b G X'^, and [b, P] represents the vector made by concatenation of b and p. We will 
use the same notation throughout the paper, namely, P, P', . . . E X, and b, b', . • • G X^. The conditional 
empirical entropy in ^ can be expressed as a function of m(y") as follows 

=if,(m(y")) := ^5^7^(m,b(y"))l^m,b(2/"), (7) 

^ /3,b 

where 1 and m. b(y") denote the all-ones column vector of length \X\, and the column in m(y") 
corresponding to b respectively. For a vector v = {vi, . . . ,vi)^ with non-negative components, we let 
7i(v) denote the entropy of the random variable whose probability mass function (pmf) is proportional 
to V. Formally, 

{ ifv=(0,...,Of. 

III. Linearized cost function 

Consider the following scheme for lossy source coding at fixed slope a > 0. For each source sequence 
let the reconstruction block be 

= arg min [H^iy'^) + «rf„(x", y^)] . (9) 

The encoder, after computing x", losslessly conveys it to the decoder using LZ compression. Let k grow 
slowly enough with n so that 



lim sup max 



-4z(y")-i/fc(2/") 

n 



< 0, (10) 



where iiziv"') denotes the length of the LZ representation of y". Note that Ziv's inequality guarantees that 
if k = kn = o(logn) then (flOl) holds. 

Theorem 1: [13] Let X be a stationary and ergodic source, let R(K, D) denote its rate distortion 
function, and let X" denote the reconstruction using the above scheme for coding X". Then 



E 



-4z(X") + ad„(X",X") 
n 



mm\R(X,D) + aD]. (11) 

D>0 



In other words, conveying the reconstruction sequence to the decoder via universal lossless compression 
(selection of LZ algorithm here is for concreteness, but other universal lossless methods can be used as 
well) achieves optimum fixed-slope rate-distortion performance universally. 

As proposed in [13], the exhaustive search required by this algorithm can be tackled through simulated 
annealing Gibbs sampling. Here assuming the source is a discrete Markov source, we propose another 
method for finding a sequence achieving the minimum in ([9]). The advantage of the new method is that 
its computational complexity is linear in n for fixed k. 

Before describing the new scheme, consider the problems (PI) and (P2) described below. 



(PI): min [i7fc(m(y")) + «t/„(x", y")] 



and 



(P2) : min 



/3 b 



(12) 



(13) 



Comparing (PI) with ^ reveals that it is the optimization required by the exhaustive search coding 
scheme described before. The question is whether it is possible to choose a set of coefficients {\^^h}f3,h, 
(3 E X and b e X'', such that (PI) and (P2) have the same set of minimizers or at least, the set of 
minimizers of (P2) is a subset of minimizers of (PI). If the answer to this question is affirmative, then 
instead of solving (PI) one can solve (P2), which, as we describe in Section |Vl can be done simply via 
the Viterbi algorithm. 

Let Si and ^2 denote the set of minimizers of (PI) and (P2). Consider some G Si, and let m* = 
m(z"). Since H{m) is concave in mfl for any empirical count matrix m, we have 

H{m) < H{m*^ + ^_if(m)U* (m^,b - m^^) (14) 

^i^(m). (15) 
Now assume that in (P2), the coefficients are chosen as follows 

Lemma 1: (PI) and (P2) have the same minimum value, if the coefficients are chosen according to 
(fT6l) . Moreover, if all the sequences in Si have the same type, then Si = S2. 
Proof: For any y'^ E X"-, 



H{xa{y^)) + ad„(a:^ y") < i/(m(y")) + ad„(a:", y^). (17) 



Therefore, 



min[i7(m(y")) + arf„(x^|/")] < mm[H{in{y^)) + arf„(x^y")] (18) 

yn yn 

< i7(m(^")) + aci„(x",z") (19) 

= min[i7(m(2/")) + y")]. (20) 

y" 

This shows that (PI) and (P2) have the same minimum values. For any sequence with m(|/") ^ m*, 
by strict concavity of H{m), 

H{m{y")) + adn{x\y'') > /7(m(y")) + y"), (21) 

> mm[H{in{y^)) + y")]. (22) 

As a result all the sequences in ^2 should have the empirical count matrix equal to m*. Since for these 
sequences H{m) = H{m), we also conclude that ^2 C Si. If there is a unique minimizing type m*, then 

Si = S2. u 

This shows that if we knew the optimal type m*, then we could compute the optimal coefficients via 
(fT6l) . and solve (P2) instead of (PI). The problem is that m* is not known to the encoder (since knowledge 
of m* requires solving (PI) which is the problem we are trying to avoid). In the next section, we describe 
a method for approximating m*, and hence the coefficients {A;3_b}- 



'As proved in Appendix B. 



IV. How TO CHOOSE THE COEFFICIENTS? 

For a given stationary ergodic source X, and for any given count matrix m define D{ni) to be the 
minimum average expected distortion among all processes Y that are jointly stationary ergodic with X 
and their {k + 1)**^ order stationary distribution is according to m|l D{m) can equivalently be defined as 

D{m) = lim min Ep d{x''\ y''') , (23) 

fei— »oo p(x'=i,j/'=i)gA1('=i) 

where Ai^'''^^ is the set of all jointly stationary distributions p{x^^,y^^) of {X^^.Y'^^) with marginal 
distributions with respect to x coinciding with the k}^' order distribution of X process, and with marginal 
distributions with respect to y coinciding with m, i.e., having the {k + 1)**^ order marginal distribution 
described by m. 

Lemma 2: If the source is order Markov, then 

D{m)= min Epd{x^\y^^), (24) 

where ki = max(£, A; + 1). 

Proof: [outline] Using the technique described in Appendix A, for any legitimate given joint distri- 
bution p{x''^, y^^) with the marginal distribution with respect to x coinciding with the source distribution 
and with marginal distribution with respect to y coinciding with some given distribution m, it is possible 
to construct a process which is jointly stationary and ergodic with our source process and also has the 
{k+ 1)*^ order joint distribution as p{x'^^ ,y'^^). Using this gives us an achievable distortion, i.e., an upper 
bound on D{m). On the other hand, the limit given in (1231) is approaching D{m) from below. Combining 
the upper and lower bounds yields the desired equality. ■ 
Since by assumption the encoder does not know therefore it can not compute max(^, /c + 1). But 
letting ki = k -\- 1, where k = o(logra), for any fixed order i, ki will eventually for n large enough, 
exceed i, and hence be equal to max(£, k + 1). Having this observation in mind, consider the following 
optimization problem, 

min H(m) + aD(m) 
s.t. m e M^^'\ (25) 
By Lemma |2l an equivalent representation of (l25l) is 

min H{ni) + a dk,{(3'h' , f3h)p,{(3'h')qy\,{f3h\f3'h') 

/3,/3',b,b' 

S.t. m^,b =5^i?.(/5'b')g,|,(/3b|/3'b'), V /3, b, 

< g,|.(/?b|/?'b') < 1, V/3,/?',b,b', 
J2QyUPWh') = l, V/5',b', 

/3,b 

Y,PAP'^')qyUPh\pV)=J2pA^'P')QyU^P\^'P') Vb,b'. (26) 

The last constraint in (|26l ) is the stationarity condition defined in (lA-ll) . and ensures that the joint 
distribution defined by px{Ph)qy\x{l3'h'\[3,h') over [x^^^ ,y^^^) corresponds to {k + 1)*"^ order marginal 
distribution of some jointly stationary processes (X, Y). Note that the variables in (|26|) are conditional 
distributions qy\x{y^^\x^^), but we are only interested in the m that they induce. 



^As discussed in Appendix A, tlie set of such processes is non-empty for any legitimate m. 



Lemma 3: If for each n, (PI) has a unique minimizing type m*, then 



- mnlkv ^ 0, a.s., (27) 

where m* is the solution of (l26l) . 

Remark: In (|26|) . the only dependence on n is through A;i. 

Therefore, if the encoder knew the distribution of the source, it could solve (|26l) . find a good approx- 
imation of m*, and then use (fT6l) to compute the coefficients required by (P2). The problem is that the 
encoder does not have this information, and only knows that the source is Markov (but does not know 
its order). To overcome its lack of information, a reasonable step is to use empirical distribution of the 
source instead of the true unknown distribution in (|26|) . For a!'^ G A'^'^ define the k^^ order empirical 
distribution of the source as 

p'i^\a'^) ^ \{'--(^^-f^.^---^^^-^) = ^''}\ _ (28) 

The following lemma shows that for ki = o(logn), p^'^'^^ converges to the actual kf order distribution of 
the source, and therefore can be considered as a good approximation for it. 
Lemma 4: For ki = o(logn), and any stationary ergodic Markov source, 

ll^(fei) ^ a.s., (29) 

where p^'^'^^ is the true kf" order distribution of the Markov source. 

Assume is generated by a discrete Markov source, and let px be its empirical distribution defined 
in (|28l) . Consider the following optimization problem 

min H{ui)+a 4i(/9'b', /3b)pi'=^)(/5'b')g,|,(/3b|/5'b') 

/3,/3',b,b' 

s.t. m^,b =^p^)(/5'b')g,|.(/5b|/5'b'), V/5,b, 

/3'h' 

0<qyU(3h\P'h')<l, V/3,/3',b,b' 
5^g,|.(/3b|/?'b') = l, V/3',b', 

I3,h 

J2pi''\(^'^')Qy\.{f3h\f3'h') = 5^)p('=^)(b'/?')g,|.(b/3|b'/5'), Vb,b'. (30) 

and let m* denote the output of the above optimization problem. 
Lemma 5: For ki = ki{n) = o(logr;,), 

l|mn-mnllTv^O, a.s. 
Proof: [outline] The input parameters of the optimization problem (l30l) are {p^''^\a''^)}akiexf'i^ 
therefore rh* = ui!^({p^''^\a^^)}^k-^^^kj). On the other hand, both the cost function and the constraints 
of (l30l) are continuous both in input parameters and optimization variables. This means that m* in turn 
is a continuous function of {pix''^)}^ki^x'=i- ■ 
Let {A/3 b(?^)}/3,b denote the optimal values of the coefficients defined at m* (as given in (fT6l)). and let 
{A/3 b('^)}/3,b be coefficients computed at m*, then 
Lemma 6: 

max \X^^h{n) — A/3 b(?T-)| ^ as n ^ oo. (31) 

/3,b 

These results suggest that for computing the coefficients we can solve the optimization problem given 
in (|30l) (whose complexity can be controlled with the rate of increase of ki), and then substitute the result 



in (fT6l) to obtain the approximate coefficients. After that (P2) defined by these coefficients can be solved 
using the Viterbi algorithm in a way that will be detailed in the next section. The succession of lemmas 
detailed in the previous sections then allow us to prove the following theorem. 

Theorem 2: Let X let a stationary and ergodic Markov source, and i?(X, D) denote its rate distortion 
function. Let X" be the reconstruction sequence obtained using the above scheme for coding X" choosing 
ki = k + 1, where k = o(logn). Then 



min \R(X, D) + aD] . (32) 

D>0 



E iiffc(m(X")) + ad„(X",X" 

Remark: Theorem |2l implies the fixed-slope universality of the scheme which does the lossless com 
pression of the reconstruction by first describing its count matrix (costing a number of bits which is 
negligible for large n) and then doing the conditional entropy coding. 

V. Viterbi coder 

As proved in Section Hill instead of solving (PI), one can solve (P2) for proper choices of coefficients 
{Ab,/?}- Note that 



1 " 



n 

/3,b 1= 



(33) 



This alternative representation of the cost function suggests that instead of using simulated annealing, we 
can find the sequence that minimizes the cost function by the Viterbi algorithm. For i = A; + 1, . . . , n, let 
Si = yl_k be the state at time i, S be the set of all 2'"+^ possible states, and for s = 6*^"*"^ define 

w{s,i) := \hk+i,bk + ad{xi,bk+i). 

From our definition of the states Sj = g{si^i,yi), where g : S x X ^ S. This representation leads 
to a Trellis diagram corresponding to the evolution of the states {sj}™^.,^]^ in which each state has \X\ 
states leading to it and \X\ states branching from it. Assume that weight w{si,i) is assigned to the edge 
connecting states and Sj, i.e., the cost of each edge only depends on the tail state. 

It is clear that in our representation, there is a 1-to-l correspondence between sequences G Af" and 
sequences of states {si}'^i^_^_i, and minimizing (1331 ) is equivalent to finding the path of minimum weight 
in the corresponding Trellis diagram, i.e., the path that minimizes XlILfc+i ^'(-^^i 0- Solving this 

minimization can readily be done by Viterbi algorithm which can be described as follows. For each state 
s, let C{s) be the two states leading to it, and for any i > 1, 

C{s, i) := min \w{s, i) + C{s', i - 1)]. (34) 
s'e£(s) 

For i = 1 and s = b'^'^^, let C{s, 1) := Afe^^^^ + adk+i{x'^~^^ , b'''^^) . Using this procedure, each state s 
at each time j has a path of length j — k — 1 which is the minimum path among all the possible paths 
between i = k + 1 and i = j such that sj = s. After computing {C{s, i)} ses , at time i = n, let 

ie{k+l,...,n} 

s* = argmin C(s, ra). (35) 

ses 

It is not hard to see that the path leading to s* is the path of minimum weight among all possible paths. 
Note that the computational complexity of this procedure is linear in n but exponential in k because 
the number of states increases exponentially with k. 



0.75 




Fig. 1. {dn{x"' , x") , Hk.{x")) of output points of Viterbi encoder when the coefficients are computed at m[3;"]. For each value of a, the 
algorithm is run L — 20 times. Here ?i — 5000, k = 7, and the source is binary Markov with q — 0.2 



VI. Simulation results 

In this section, some preliminary simulation results of the application of Viterbi encoder described in the 
previous section is presented. In our simulations, instead of computing the coefficients {Xp^h} from (fT6l) 
at the optimal point m*, we compute them at the count matrix of the input sequence x", m(x"). Fig. [T] 
demonstrates ((i„(x", y"), iJ/c(m(?/"))) of output points of the described algorithm. The block length is 
n = 5000, k = 7 and the source is order binary symmetric Markov with transition probability q = 0.2. 
For each value of a the algorithm is applied to L = 20 different randomly generated sequences. The 
reason of getting some points below the rate-distortion curve is that the actual number of bits required 
for describing x" losslessly to the decoder is larger than nHk{x^), but converges to it as n grows. For 
example, for the simple scheme of separately describing the subsequences corresponding to different 
preceding contexts, this surplus is of order 2'^logn/n. The effect of this excess rate is not reflected in 
the figure, which explains why some points appear below the rate-distortion curve. 

It can be observed that for larger values of a the output points are closer to the curve. The reason is 
that large values of a correspond to small values of distortion, and if the distortion is small then m(a;") 
is a good approximation of m(y"). 

Finally, Fig. [2l compares the performance of the new Viterbi encoder and the MCMC encoder described 
in [13]. Here the source is again binary symmetric Markov with q = 0.2, and the other parameters are: 
k = 7, n = 5, 000, Pt = nlogt, r = lOn, where Pt determines the cooling schedule of the MCMC coder 
and r is its number of iterations. Each point is the figure corresponds to the average performance of 
L = 10 random realizations of the source. It can be observed that even for this simplistic choice of the 
coefficients the performance of the algorithms are comparable, while the Viterbi encoder for example in 
this example runs at least 40 times faster. 

VII. Conclusions and Current Directions 

In this paper, a new method for universal fixed-slope lossy compression of discrete Markov sources was 
proposed. The new method achieves the rate-distortion curve for any discrete Markov source. Extending 




the algorithm to work on any stationary ergodic source is under current investigation. We believe that in 
fact the same algorithm works for the general class of stationary ergodic sources, and only the proof should 
be extended to work in this case as well. Another direction for future work is finding a simple method for 
approximating the optimal coefficients that would alleviate the need for solving the optimization problem 

APPENDIX A: Stationarity condition 

Assume that we are given a \X\ x matrix m with all elements positive and summing up to one. 
The question is under what condition(s) this matrix can be {k + 1)*'^' order stationary distribution of a 
stationary process. For the ease of notations, instead of matrix m consider p(x''~^^) as a distribution defined 
on X''^^. We show that a necessary and sufficient condition is the so-called stationarity condition which 
is 

5^p(/3x'=) = (A-1) 

/3ex pax 

- Necessity: The necessity of (lA-ll) is just a direct result of the definition of stationarity of a process. 
If p{x^'^^) is to represent the {k + 1)*'^ order marginal distribution of a stationary process, then it 
should be consistent with the k^^ order marginal distribution as satisfy (lA-ll) . 

- Sufficiency: In order to prove the sufficiency, we assume that (lA-lj) holds, and build a stationary 
process with [k + 1)*'^ order marginal distribution of Consider a k^^ order Markov chain 
with transition probabilities of 

Note that is well-defined by (lA-ll) . Moreover, again from (lA-ll) . is the stationary 



distribution of the defined Markov chain, because 

Xl Xl 

Therefore we have found a stationary process that has the desired marginal distribution. 
Finally we show that if m is the count matrix of a sequence y", then there exist a stationary process 
with the marginal distribution coinciding with m. From what we just proved, we only need to show that 
(|A-1|) holds, i.e.. 

But this is true because both sides of (IA-41) are equal to \{i : y]^\ = h}\/n. 

APPENDIX B: Concavity of H{m) 
For simplicity assume that X = X = {0,1}. By definition 

H{ui)= V (mo,b + mi,b)M ), (B-1) 

where h{a) = a log a + a log a and a = 1 — a. We need to show that for any 9 E [0, 1], and empirical 
count matrices m'^^'' and m^'^\ 

eH{m^^'^) + eH{m^^^) < H{9m^^^ + ^m^^)). (B-2) 

From the concavity of h, it follows that 

(1) (2) 
(1) W\ht "^O'b ^ at (2) (2)n/,/ "^0,b X 



where 6^1 = 1 — ^2 = ^- Now summing up both sides of (IB-31) over all b G X'', yields the desired result. 
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